Introduction à la programmation Triton : Le paradoxe des performances : Pourquoi un code correct est lent

Le Paradoxe des performances affirme qu'un noyau mathématiquement parfait, comme $out = x + y$, peut en réalité fonctionner plus lentement qu'une boucle CPU si les coûts fixes du matériel GPU ne sont pas amortis. Cela se manifeste souvent sous forme de Frais de lancement.

1. L'erreur de la « correction »

La correction fonctionnelle n'est pas un indicateur d'efficacité. Même si votre code Triton répartit correctement le travail sur des milliers de threads, si la quantité totale de travail (N) est faible, la GPU reste sous-utilisée. Le matériel passe plus de temps dans les transitions d'état que dans des calculs réels.

2. Le piège de mesure en Python

Mesurer le code GPU depuis Python en utilisant time.time() est dangereux. Les appels GPU sont asynchrones; Python ne fait que mettre en file d'attente la commande et continue. Sans torch.cuda.synchronize(), vous mesurez le temps d'attente. Avec la synchronisation, vous mesurez la latence hôte-vers-dispositif, qui est souvent 10 fois plus longue que l'exécution du noyau lui-même.

3. Latence vs. Débit

Pour surmonter ce paradoxe, il faut fournir suffisamment de travail pour « cacher » la latence de lancement. C'est la transition d'un régime limité par la latence (limité par le bus CPU-GPU) vers un régime limité par le débit (limité par la mémoire ou le calcul GPU).

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

For each kernel, decide whether the bottleneck is likely arithmetic throughput, memory bandwidth, or launch overhead: Vector addition (N=256), Vector addition (N=10^8), and Matrix Multiplication (N=8192).

N=256: Arithmetic; N=10^8: Bandwidth; MM: Launch

N=256: Launch; N=10^8: Bandwidth; MM: Arithmetic

N=256: Bandwidth; N=10^8: Arithmetic; MM: Launch

All are compute-bound.

QUESTION 2

In the context of the Performance Paradox, what is the primary bottleneck for a 'ReLU on a matrix' operation?

Arithmetic Throughput

Memory Bandwidth

L1 Cache Size

QUESTION 3

What does the term 'Asynchronous Execution' imply regarding GPU benchmarking?

The GPU and CPU always finish at the same time.

The CPU continues to the next line of code before the GPU kernel finishes.

The kernel runs faster on smaller GPUs.

Memory transfers are blocked by compute.

QUESTION 4

Why does $out = x + y$ exhibit low arithmetic intensity?

It uses three memory accesses (2 loads, 1 store) for a single floating-point operation.

The addition operation is too complex for the ALUs.

It requires shared memory synchronization.

It only runs on one SM.

QUESTION 5

How can the 'Launch Tax' be amortized in a real-world application?

By calling the kernel more frequently with smaller data.

By increasing the workload per launch (e.g., larger N or batching).

By using 16-bit floats instead of 32-bit floats.

By disabling the L2 cache.